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Abstract 

We propose a method for statistical analysis of time series, that allows 
[/3 ' us to obtain solutions to some classical problems of mathematical statis- 

O . tics under the only assumption that the process generating the data is 

stationary ergodic. Namely, we consider three problems: goodness-of-fit 
(or identity) testing, process classification, and the change point prob- 
^ ■ lem. For each of the problems we construct a test that is asymptotically 

f^ ' accurate for the case when the data is generated by stationary ergodic 

processes. The tests are based on empirical estimates of distributional 
lO ' distance. 
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1 Introduction 



H I [^ Overview. In this work we consider the problem of statistical analysis of 

time series, when nothing is known about the underlying process generating the 
data, except that it is stationary ergodic. There is a vast literature on time 
series analysis under various parametric assumptions, and also under such non- 
parametric assumptions as that the processes is finite-memory or has certain 
mixing rates. While under these settings most of the problems of statistical 
analysis are clearly solvable and efficient algorithms exist, in the general setting 
of stationary ergodic processes it is far less clear what can be done in principle, 
which problems of statistical analysis admit a solution and which do not. In this 
work we propose a method of statistical analysis of time series, that allows us 
to demonstrate that some classical statistical problems indeed admit a solution 
under the only assumption that the data is stationary ergodic, whereas before 

^Some preliminary results appear in |16) . 



solutions only for more restricted cases were known. The solutions are always 
constructive, that is, we present asymptotically accurate algorithms for each of 
the considered problems. All the algorithms are based on empirical estimates of 
distributional distance, which is in the core of the suggested approach. We sug- 
gest that the proposed approach can be applied to other problems of statistical 
analysis of time series, with the view of establishing principled positive results, 
leaving the task of finding optimal algorithms for each particular problem as a 
topic for further research. 

Here we concentrate on the following three problems: goodness-of-fit (or 
identity) testing, process classification, and the change point problem. 
Identity testing. The first problem is the following problem of hypothesis test- 
ing. A stationary ergodic process distribution p is known theoretically. Given 
a data sample, it is required to test whether it was generated by p, versus 
it was generated by any other stationary ergodic distribution that is different 
from p (goodness-of-fit, or identity testing). The case of i.i.d. or finite-memory 
processes was widely studied (see e.g. [5]); in particular, when p has a finite 
memory [TS] proposes a test against any stationary ergodic alternative: a test 
that can be based on an arbitrary universal code. It was noted in [17] that an 
asymptotically accurate test for the case of stationary ergodic processes over 
finite alphabet exists (but no test was proposed). Here we propose a concrete 
and simple asymptotically accurate goodness-of-fit test, which demonstrates the 
proposed approach: to use empirical distributional distance for hypotheses test- 
ing. By asymptotically accurate test we mean the following. First, the Type 
I error of the test (or its size) is fixed and is given as a parameter to the test. 
That is, given any a > as an input, under Hq (that is, if the data sample was 
indeed generated by p) the probability that the test says "ifi" is not greater 
than a. Second, under any hypothesis in Hi (that is, if the distribution gener- 
ating the data is different from /?), the test will say "i?o" not more than a finite 
number of times, with probability 1. In other words, the Type I error of the 
test is fixed and the Type H error can be made not more than a finite number 
of times, as the data sample increases, with probability 1 under any stationary 
ergodic alternative. 

A comment on this setting is in order. When the alternative Hi is less 
general, e.g. distributions that have finite-memory |10| or known mixing rates, 
one typically seeks a test that has optimal rates of decrease of probability of 
Type II error to 0. For our case, when the alternative is the set of all stationary 
ergodic processes, the rate of decrease of probability Type II error is necessarily 
non-uniform. In this sense, the property that we establish for our test is the 
strongest possible. Observe that it is strictly stronger than requiring that the 
test makes only a finite number of errors (either Type I or Type II), the setting 
considered, for example, in the cases when Hq is composite, or for the process 
classification problem that we address in this work. 

Process classification. In the next problem that we consider, we again have 
to decide whether a data sample was generated by a process satisfying a hy- 
pothesis i/o or a hypothesis Hi. However, here i^o and Hi are not known 
theoretically, but are represented by two additional data samples. More pre- 



cisely, the probelm is that of process classification, which can be formulated as 
follows. We are given three samples X = (Xi, . . . , Xk), Y = (Fi, . . . , Ym) and 
Z — {Zi, . . . , Zn) generated by stationary ergodic processes with distributions 
px, Py and pz- It is known that px 7^ Py but either pz = px or pz = Py- It 
is required to test which one is the case. That is, we have to decide whether 
the sample Z was generated by the same process as the sample X or by the 
same process as the sample Y. This problem for the case of dependent time 
series was considered for example in |10) , where a solution is presented under the 
finite-memory assumption. It is closely related to many important problems in 
statistics and application areas, such as pattern recognition, classification, etc. 
Apparently no asymptotically accurate procedure for process classification has 
been known so far for the general case of stationary ergodic processes. Here 
we propose a test that converges almost surely to the correct answer. In other 
words, the test makes only a finite number of errors with probability 1, with 
respect to any stationary ergodic processes generating the data. Unlike in the 
previous problem, here we do not explicitly distinguish between Type I and 
Type II error, since the hypotheses are by nature symmetric: Hq is "pz = px" 
and Hi is "p^ = py". 

Change point estimation. Finally, we consider the change point problem. It 
is another classical problem, with vast literature on both parametric (see e.g. [2]) 
and non-parametric (see e.g. [S]) methods for solving it. In this work we address 
the case where the data is dependent, its form and the structure of dependence 
is unknown, and marginal distributions before and after the change may be the 
same. We consider the following (off-line) setting of the problem: a (real- valued) 
sample Zi , . . . , Z„ is given, where Zi, . . . ,Zk are generated according to some 
distribution px and Zk+i, . . . , Z„ are generated according to some distribution 
Py which is different from px- It is known that the distributions px and pY 
are stationary ergodic, but nothing else is known about them. Most literature 
on change point problem for dependent time series assumes that the marginal 
distributions before and after the change point are different, and often also make 
explicit restrictions on the dependence, such as requirements on mixing rates. 
Nonparametric methods used in these cases are typically based on Kolmogorov- 
Smirnov statistic, Cramer- von Mises statistic, or generalizations thereof [51 [31 [5] ■ 
The main difference of our results is that we do not assume that the single- 
dimensional marginals (or finite-dimensional marginals of any given fixed size) 
are different, and do not make any assumptions on the structure of dependence. 
The only assumption is that the (unknown) process distributions before and 
after the change point are stationary ergodic. Our result is a demonstration of 
that asymptotically accurate change point estimation is possible in this general 
setting. 

Methodology. All the tests that we construct are based on empirical esti- 
mates of the so-called distributional distance. For two processes pi, P2 a distri- 
butional distance is defined as J^'kLi ''^klpiiBk) — p2{Bk)\, where Wk are positive 
summable real weights, e.g. Wk — 2"*^ and B^ range over a countable field that 
generates the sigma-algebra of the underlying probability space. For exam- 
ple, if we are talking about finite-alphabet processes with the binary alphabet 



A = {0, 1}, Bk would range over the set A* = Uke^A'^; that is, over all tu- 
ples 0, 00, 01, 10, 000, 001, . . . ; therefore, the distributional distance in this case 
is the weighted sum of differences of probabilities of all possible tuples. In this 
work we consider real-valued processes, A = R, so Bk can be taken to range 
over all intervals with rational endpoints, all pairs of such intervals, triples, 
etc. Although distributional distance is a natural concept that, for stochastic 
processes, has been studied for a while [9j, its empirical estimates have not, 
to our knowledge, been used for statistical analysis of time series. We argue 
that this distance is rather natural for this kind of problems, first of all, since 
it can be consistently estimated (unlike, for example, d distance, which can- 
not [13] be consistently estimated for the general case of stationary ergodic 
processes). Secondly, it is always bounded, unlike (empirical) KL divergence, 
which is often used for statistical inference for time series (e.g. [6l [151 [II (3 [H] 
and others). Other approaches to statistical analysis of stationary dependent 
time series include the use of (universal) codes [HI [151 El ■ Here we first show 
that distributional distance between stationary ergodic processes can be consis- 
tently estimated based on sampling, and then apply it to construct a consistent 
test for the three problems of statistical analysis described above. 

Although empirical estimates of the distributional distance involve taking 
an infinite sum, in practice it is obvious that only a finite number of summands 
has to be calculated. This is due to the fact that empirical estimates have to 
be compared to each other or to theoretically known probabilities, and since 
the (bounded) summands have (exponentially) decreasing weights, the result of 
the comparison is known after only finitely many evaluations. Therefore, the 
algorithms presented can be applied in practice. On the other hand, the main 
value of the results is in the demonstration of what is possible in principle; 
finding practically efficient procedures for each of the considered problems is an 
interestring problem for further research. 

2 Preliminaries 

We are considering (stationary ergodic) processes with the alphabet A = M. 
The generalization to A = R'' is straightforward; moreover, the results can be 
extended to the case when A is a complete separable metric space. We use 
the symbol A* for U°^-^^A'^. Elements of A* are called words or sequences. For 
each k G N, let B'^ be the set of all cylinders of the form Ai x ■ ■ ■ x A^ where 
Ai C A are intervals with rational endpoints. Let B = U^j^i?*^; since this 
set is countable we can introduce an enumeration B = {Bi : i E N}. The set 
{B,xA°° -AG N} generates the Borel a-algebra on R°° = yl°°. For a set B e ^B 
let \B\ be the index k of the set B'^ that B comes from: \B\ = k : B <E B'' . 

For a sequence X £ A^ and & set B £ B denote I'iX, B) the frequency with 
which the sequence X falls in the set B 

v{X,B):=\ n-\B\+i Er=i^'^^ I{(x,,...,x,+ \s\-i)eB} ifn>|B|, 
I otherwise 



where X = (Xi, . . . , X„). For example, 

i^((0.5, 1.5, 1.2, 1.4, 2.1), ([1.0, 2.0] x [1.0, 2.0])) = 1/2. 

We use the symbol S for the set of all stationary ergodic processes on A°° . 
The ergodic theorem (see e.g. [3]) implies that for any process p € S generating 
a sequence Xi , X2 , . . . the frequency of observing a tuple that falls into each 
B ^ B tends to its limiting (or a priory) probability a.s.: 

y{{X^, . . . ,X„), S) ^ p((Xi, . . . , X|s|) e B) 

as n — > 00. We will often abbreviate p((Xi, . . . , -'^isi) G B) =: p{B). 

Definition 1 (distributional distance). The distributional distance is defined 
for a pair of processes pi , p2 as follows J^: 

00 

d(pl,P2)=5]u'.|pl(i3^)-p2(i?.)|, (1) 

i=l 

where Wi are summable positive real weights (e.g. Wk = 2^^ ). 

It is easy to see that d is a metric. The reader is referred to 9; for more 
information about d and its properties. 

Definition 2 (empirical distributional distance). For X,Y d A*, define em- 
pirical distributional distance d{X, Y) as 

00 
d{XX} ■.= J2u>MX,B,)^,,{Y,B^,)\. (2) 

j=i 

Similarly, we can define the empirical distance when only one of the process 
measures is unknown: 

00 
d{X,p):^Y.'^MX,B,)~p{B,)\, (3) 

i=l 

where p G 5 and X €L A* . 

The following lemma will play a key role in establishing the main results. 

Lemma 1. Let two samples X — {Xi, . . . , Xk) and Y — (Yi, . . . , Y„i) be gener- 
ated by stationary ergodic processes px and py respectively. Then 

(i) Yvaikjn^ood{X,Y)^ d{px,PY) a.s. 

(a) Vmik^ryod{X,pY) = d{px,PY) a.s. 



Proof. For any £ > we can find such an index J that X^i^j "^i < ^/2- More- 
over, for each j we have z^((Xi, . . . , X^), Bj) -^ px{Bj) a.s., so that 

|K(^i,...,^fc),S,)-p(S,)l<e/(4Jw,) 

from some step k on; define -ftTj := k. Let iiT := inaxj<j Kj (K depends on the 
realization Xi,X2, ■ ■ ■)■ Define analogously M for the sequence (li, . . . , Ym, . . . ). 
Thus for k > K and m > M we have 

|d(X,y)-d(px,Py)| = 

oo 

Y^w,{HX,B,) - iy{Y,B,)\ - \px{B,) - py{B,)\) 



< J2 MH^^ B,) - px{B^)\ + \v{Y, B,) - py{B,)\) 

i=l 
J 

<Y,w^{HX,B,) ^ px{B,)\ + \v{Y,Bi) - py{B,)\) + e/2 



i=l 

J 

< ^ w^(e/(4Jwi) + e/{4Jwi)) + e/2 = e, 

which proves the first statement. The second statement can be proven analo- 
gously, n 

3 Main results 

3.1 Goodness-of-fit Test 

For a given stationary ergodic process measure p and a sample X — (X\^ . . . , Xn) 
we wish to test the hypothesis Hq that the sample was generated by p versus 
Hi that it was generated by a stationary ergodic distribution that is different 
from p. Thus, Hq — {p} and Hi = S\Ho. 

Define the set D^ as the set of all samples of length n that are at least d-iar 
from p in empirical distributional distance: 

D^:={XeA'':d{X,p)>6}. 

For each n and each given confidence level a define the critical region C"^ of the 
test as C^ := D" where 

7 := ini{S : p{D^) < a}. 

The test rejects Hq at confidence level a if {Xi, . . . ,X„) e C^ and accepts it 
otherwise. In words, for each sequence we measure the distance between the 
empirical probabilities (frequencies) and the measure p (that is, the theoretical 
p-probabilities); we then take a largest ball (with respect to this distance) around 
p that has p-probability not greater than 1 — a. The test rejects all sequences 
outside this ball. 



Definition 3 (Goodness-of-fit test). For each n G N and ad (0, 1) the goodness- 
of-fit test G" : A" ^ {0, 1} is defined as 

C^iX X ) ■= I ^ ^f i^l^ ■ ■ ■ i^n) & C^, 

" '■■■' "-' • 1^ g otherwise. 
Theorem 1. The test G" has the following properties. 

(i) For every a G (0, 1) and every n G N the Type I error of the test is not 
greater than a: p{G'^ = I) < a. 

(a) For every a G (0, 1) the Type II error goes to almost surely: for every 
p' =/= p we have lini„^oo G" = 1 with p' probability 1. 

Note that using an appropriate randomization in the definition of C^' we can 
make the Type I error exactly a. 

Proof. The first statement holds by construction. To prove the second state- 
ment, let the sample X be generated by p' £ S, p' ^ p, and define S = d{p, p')/2. 
By Lemma [T] we have p{Dg) -^ 0, so that p{D2) < a from some n on; denote 
it ni. Thus, for n > ni we have D^ C C^- At the same time, by Lemma[T]we 
have d{X, p) > S from some n on, which we denote n2{X), with p'-probability 1. 
So, for n > max{ni,n2(X)} we have X G D^ C C^, which proves the state- 
ment (ii). n 

3.2 Process classification 

Let there be given three samples X = {Xi, . . . ,Xk), Y = {Yi,...,Y.m) and 
Z = [Zi, . . . , Zn). Each sample is generated by a stationary ergodic process 
px, Py and pz respectively. Moreover, it is known that either pz = px or 
Pz = Py, but Px "^ Py- We wish to construct a test that, based on the finite 
samples X, Y and Z will tell whether pz = px or pz ^ Py- 

The test chooses the sample X oiY according to whichever is closer to Z in 
d. That is, we define the test G(X, F, Z) as follows. If d{X, Z) < d(Y, Z) then 
the test says that the sample Z is generated by the same process as the sample 
X, otherwise it says that the sample Z is generated by the same process as the 
sample Y. 

Definition 4 (Process classifier). Define the classifier L : A* xA* xA* -^ {1,2} 
as follows 

._/ 1 ifd{X,Z)<d{Y,Z) 

for X,Y,Z ^ A* 



' 2 otherwise, 



Theorem 2. The test L{X,Y,Z) makes only a finite number of errors when 
\X\, \Y\ and \Z\ go to infinity, with probability 1: if px = pz then L{X, Y, Z) = 1 
from some \X\,\Y\,\Z\ on with probability 1; otherwise L{X,Y, Z) = 2 from 
some \X\, \Y\, \Z\ on with probability 1. 



Proof. From the fact that d is a metric and from Lemma [T] we conclude that 
d{X, Z) — > (with probabihty 1) if and only if px — pz- So, if px = pz then 
by assumption py ^ pz and d(X, Z) ^ a.s. while 

d{Y,Z)^d{pY,pz)^0. 

Thus in this cased(F, Z) > d(X, Z) from some |X|, |F|, |Z| on with probability 1, 
from which moment we have L{X, Y, Z) — 1. The opposite case is analogous. D 

3.3 Change point problem 

The sample Z — (Zi, . . . , Z„) consists of two concatenated parts X — (Xi, . . . , Xk) 
and Y = (Yi, . . . , Ym), where m = n — k, so that Zi = Xi for 1 < i < k and 
Zkj^j — Yj for 1 < J < TO. The samples X and Y are generated indepen- 
dently by two different stationary ergodic processes with alphabet A — M.. The 
distributions of the processes are unknown. The value k is called the change 
point. It is assumed that k is linear in n; more precisely, an < k < (3n for some 
< a < P < 1 from some n on. 

It is required to estimate the change point k based on the sample Z. 

For each t, 1 < t < n, denote [/* the sample {Zi, . . . , Zt) consisting of the 
first t elements of the sample Z, and denote V* the remainder {Zt+i, . . . , Z„). 

Definition 5 (Change point estimator). Define the change point estimate k : 
A* ^N as follows: 

k{Xi, . . .,Xn) := argmaxtg[y^„„y;j] d{U\ F*). 

It should be noted that the term ^/n in this definition can be replaced by 
any o(n) function that goes to infinity with n; this, in particular, does not affect 
the theorem below. Alternative approaches used in the literature on the change 
point problem are to introduce weights near the ends of the sample, or to assume 
known linear bounds on the change point (see e.g. [5]). 

Theorem 3. For the estimate k of the change point k we have 

\k — k\ — o(n) a.s. 

where n is the size of the sample, and when fc,7i — /s — > (X) in such a way that 



a 



< — < P for some a, (3 £ (0, 1) from some n on. 

Proof. To prove the statement, we will show that for every 7, < 7 < 1 
with probability 1 the inequality (i(C/*, V^*) < d{X^ Y) holds for each t such that 
i/n < t < "fk possibly except for a finite number of times. Thus we will show that 
linear 7-underestimates occur only a finite number of times, and for overestimate 
it is analogous. Fix some 7, < 7 < 1 and e > 0. Let J be big enough to have 
Si^j ^i < s/2 and also big enough to have an index j < J for which px{Bj) ^ 
Py{Bj). Take M^ e N large enough to have \v{Y,Bi) - py(Si)| < e/2.J for 
all m > Me and for each i, 1 < i < J, and also to have \Bj\/m < e/J. This 



is possible since empirical frequencies converge to the limiting probabilities a.s. 
(that is, Me depends on the realizations Yi, I2, ■ • ■ ) (cf. the proof of Lemma [1]). 
Observe that the distribution of the sample Xs,Xs+i, . . . ,Xk, where s is chosen 
independently of the sample, is governed by the same stationary ergodic process 
as Xi, . . . , Xk- Therefore, we can find such a Kg (that depends on X) that for 
aU k > Kg and for alH, 1 < i < J we will have |i^(f7*, Bi) - px{Bi)\ < e/2J for 
each t > yjn, and |j/((Xs,Xs+i, . . . ,Xk),Bi) — px[Bi)\ < e/2J for each s < -fk. 
So, for each s G [\/n, jk] we have 



iy{V\B, 



{l~-f)kpx{Bj)+mpYiBj) 



< 



(1 — 7)fc + m 
(1 - -f)kiyiiXs, . . . , Xk), B,) + mv{Y, Bj) 



(1 — 7)fc + m 
{I - -l)kpx{Bj) + mpY{Bj) 



(1 — ^)k + m 



\B, 



m + jk 



< 3£/ J, 



for k > Kg and m > Mg (from the definitions of Kg and Mg). Hence 



HX, Bj) - v{Y, B^)\- \v{U\ Bj) ~ z.(^^ Bj)\ 
>HX,B,)-u{Y,B,)\ 

nrs a \ {l~-i)kpx{Bj)+mpY{Bj) 

>\pxiB,)-pYiBj)\ 

- PxiBj) - 



(1 — j)k + m 
(l~-f)kpx{Bj) + mpY{Bj) 



3e/J 



(l-7)/c + ' 



■ie/J 



= Sj - As/ J, 



for some Sj that depends only on k/m and 7. Summing over all Bi, i £ N, we 
get 

d{X, Y) - d(C/", V) > WjSj - he, 

for all n such that k > Kg and ?7i > Mg, which is positive for small enough e. D 
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